This web page contains information and pointers to information on parallelization in general.
This section concentrates on the general parallelization methods commonly used. It is important to analyse the problem at hand and decide what kind of parallelization is to be used, regardless of the parallelization soft- and hardware available. Two important classes have emerged in the history of parallel programming: data parallel and functional parallel.
Data parallel methods
The data parallel methods are essentially based on distribution of the data among several processors. Usually the processors execute the same kind of code on different pieces of data. Examples can be found in various techniques for matrix multiplication, matrix inversion and so on. Also in the case that memory at one processor is not enough to hold all data, data parallel programming can offer a solution.
A typical data-parallel code has the following structure:
Depending on the problem at hand, the code required can be difficult to write and maintain. Fortunately, there are tools available: ScaLAPACK is a subroutine library that contains data-parallelized versions of often used matrix operations.
Functional parallel methods are based on the distribution of the execution of more or less independent tasks on several processors. Many problems can be transformed this way. Good examples are Monte-Carlo simulation where the same calculations are performed again and again, each time with different (randomly) chosen parameters. The child processes don't have to be identical: each child process can perform different calculations, perhaps on the basis of the input data at hand. In general, the farming paradigm is used: one parent process is responsible for allocating work to child processes, collect and process the answers.
A typical functional parallel program has the following structure:
After creation of parent and child processes:
Parent code
Child code
In general, functional parallel programs are relatively easy to write and maintain. Furthermore, it is not too difficult to add or delete processors from a running farm, as long as the parent process is kept alive. Load balancing is done automatically, if the time spend in sending messages is small in comparison with the CPU time the child processes. The parent process should use as little CPU time for processing the answers of the child processes as possible.
In above example it is assumed that the child processes can perform their calculations independently. If this is not the case, the program becomes more complicated and probably some kind of synchronization is needed.
A widely accepted classification of computer architectures is "Flynn's classification", dating back to the mid-sixties. The classification is based on the number of different program streams and data streams.
Next to Flynn's classification, another subdivision is appropriate:
So, in principle you can have SM-MIMD, DM SIMD etc. systems. Things are even more complicated by the fact, that often SM machines are, on close inspection, in reality DM machines, with a very fast message passing network, firmware and software to present the system as a SM system for the programmer. Also some software packages (HPF, BSP) present a message passing system to the programmer as a SM system.
Literature: Overview of recent supercomputers, Issue 2010, Aad J. Van der Steen, NCF
Methods available methods for parallel programming
Here we will discuss the various software tools available at HPCE to produce parallel programs: MPI, PVM3, OpenMP and name some higher-level numerical libraries.
MPI is an international accepted standard for a message passing library, for use in C en Fortran programs. (MPI information) MPI is available for most computer systems and operating systems. MPI-1 is the old standard. A new, upward compatible standard, MPI-2, has been developed. There exist several implementations of MPI including:
One of the strengths of MPI is, that it is a well defined standard to which all implementations should comply. The fact that there are several brands invoked a sound competition which has led to very efficient implementations. MPI is now the most used tool for parallelization of high performance applications for distributed and shared memory systems. MPI is suitable for implementing data-parallel and function-parallel paradigms.
PVM3 (Parallel Virtual Machine, version 3) is a message passing library, to be used in Fortran and C programs. PVM3 is available for most operating systems, including Unix and Windows. PVM3 is suitable for implementing data parallel and function parallel paradigms. There is not a standard for PVM3: the standard is effectively defined by the implementation by the Oak Ridge National Laboratory.
OpenMP is a standard, supporting shared memory multiprocessing programming in C/C++ and Fortran on most operating systems, including Unix and Windows. OpenMP is defined as an extension to the C/C++ and Fortran languages, so one needs special compilers to use it. Vendors of shared memory systems often offer OpenMP-aware compilers with auto-parallelizing capabilities.
OpenMP is most suitable for implementing data parallel paradigms.
Often it makes sense to combine OpenMP with MPI, especially if the computer consists of a number of nodes, each equipped with 2 or more processors or processor-cores. MPI is used for distributing the work between the nodes, while OpenMP is used to distribute the work on a node between the processors.
Parallel programs are notoriously hard to debug. Careful insertion of print statements is often the easiest way to pinpoint an error. Also it is worthwhile to keep the parallel code as clean and simple as possible.
For debugging parallel programs the vendors of the systems often have available parallel debugging tools, specific for the system at hand. These can sometimes be used to pinpoint the reason for deadlocks, livelocks and other kinds of misbehavior of programs.
Two important things are to be considered when optimizing a parallel program:
In more detail this means:
In optimizing serial code, one can use the amount of CPU time used as a guideline: the less, the better. In optimizing parallel code, however, the total amount of CPU time used is almost meaningless: the higher the degree of parallelization, the more CPU time a program will use. Instead of CPU time, minimization of wall clock time is the goal.
Copyright © 2014 - All Rights Reserved - Domain Name
Template by OS Templates